Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
import pandas as pd
import numpy as np
#for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#for missing value imputation
from sklearn.impute import SimpleImputer
#for model buidling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
RandomForestClassifier,
BaggingClassifier,
AdaBoostClassifier,
GradientBoostingClassifier
)
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
accuracy_score, recall_score, f1_score, precision_score, confusion_matrix, roc_auc_score
)
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
#for model tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
#for pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
#link google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
data = pd.read_csv('/content/drive/MyDrive/Project ReneWind/Train.csv.csv')
df = data.copy()
df.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
test_data = pd.read_csv('/content/drive/MyDrive/Project ReneWind/Test.csv.csv')
test_data.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613 | -3.820 | 2.202 | 1.300 | -1.185 | -4.496 | -1.836 | 4.723 | 1.206 | -0.342 | -5.123 | 1.017 | 4.819 | 3.269 | -2.984 | 1.387 | 2.032 | -0.512 | -1.023 | 7.339 | -2.242 | 0.155 | 2.054 | -2.772 | 1.851 | -1.789 | -0.277 | -1.255 | -3.833 | -1.505 | 1.587 | 2.291 | -5.411 | 0.870 | 0.574 | 4.157 | 1.428 | -10.511 | 0.455 | -1.448 | 0 |
| 1 | 0.390 | -0.512 | 0.527 | -2.577 | -1.017 | 2.235 | -0.441 | -4.406 | -0.333 | 1.967 | 1.797 | 0.410 | 0.638 | -1.390 | -1.883 | -5.018 | -3.827 | 2.418 | 1.762 | -3.242 | -3.193 | 1.857 | -1.708 | 0.633 | -0.588 | 0.084 | 3.014 | -0.182 | 0.224 | 0.865 | -1.782 | -2.475 | 2.494 | 0.315 | 2.059 | 0.684 | -0.485 | 5.128 | 1.721 | -1.488 | 0 |
| 2 | -0.875 | -0.641 | 4.084 | -1.590 | 0.526 | -1.958 | -0.695 | 1.347 | -1.732 | 0.466 | -4.928 | 3.565 | -0.449 | -0.656 | -0.167 | -1.630 | 2.292 | 2.396 | 0.601 | 1.794 | -2.120 | 0.482 | -0.841 | 1.790 | 1.874 | 0.364 | -0.169 | -0.484 | -2.119 | -2.157 | 2.907 | -1.319 | -2.997 | 0.460 | 0.620 | 5.632 | 1.324 | -1.752 | 1.808 | 1.676 | 0 |
| 3 | 0.238 | 1.459 | 4.015 | 2.534 | 1.197 | -3.117 | -0.924 | 0.269 | 1.322 | 0.702 | -5.578 | -0.851 | 2.591 | 0.767 | -2.391 | -2.342 | 0.572 | -0.934 | 0.509 | 1.211 | -3.260 | 0.105 | -0.659 | 1.498 | 1.100 | 4.143 | -0.248 | -1.137 | -5.356 | -4.546 | 3.809 | 3.518 | -3.074 | -0.284 | 0.955 | 3.029 | -1.367 | -3.412 | 0.906 | -2.451 | 0 |
| 4 | 5.828 | 2.768 | -1.235 | 2.809 | -1.642 | -1.407 | 0.569 | 0.965 | 1.918 | -2.775 | -0.530 | 1.375 | -0.651 | -1.679 | -0.379 | -4.443 | 3.894 | -0.608 | 2.945 | 0.367 | -5.789 | 4.598 | 4.450 | 3.225 | 0.397 | 0.248 | -2.362 | 1.079 | -0.473 | 2.243 | -3.591 | 1.774 | -1.502 | -2.227 | 4.777 | -6.560 | -0.806 | -0.276 | -3.858 | -0.538 | 0 |
df.shape
(20000, 41)
The training data set consist of 20000 rows and 41 columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
There are 40 independent variables as type float. The target variable is a binary of type int with 1 meaning a failure and 0 meaning no failure.
df.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.000 | -0.272 | 3.442 | -11.876 | -2.737 | -0.748 | 1.840 | 15.493 |
| V2 | 19982.000 | 0.440 | 3.151 | -12.320 | -1.641 | 0.472 | 2.544 | 13.089 |
| V3 | 20000.000 | 2.485 | 3.389 | -10.708 | 0.207 | 2.256 | 4.566 | 17.091 |
| V4 | 20000.000 | -0.083 | 3.432 | -15.082 | -2.348 | -0.135 | 2.131 | 13.236 |
| V5 | 20000.000 | -0.054 | 2.105 | -8.603 | -1.536 | -0.102 | 1.340 | 8.134 |
| V6 | 20000.000 | -0.995 | 2.041 | -10.227 | -2.347 | -1.001 | 0.380 | 6.976 |
| V7 | 20000.000 | -0.879 | 1.762 | -7.950 | -2.031 | -0.917 | 0.224 | 8.006 |
| V8 | 20000.000 | -0.548 | 3.296 | -15.658 | -2.643 | -0.389 | 1.723 | 11.679 |
| V9 | 20000.000 | -0.017 | 2.161 | -8.596 | -1.495 | -0.068 | 1.409 | 8.138 |
| V10 | 20000.000 | -0.013 | 2.193 | -9.854 | -1.411 | 0.101 | 1.477 | 8.108 |
| V11 | 20000.000 | -1.895 | 3.124 | -14.832 | -3.922 | -1.921 | 0.119 | 11.826 |
| V12 | 20000.000 | 1.605 | 2.930 | -12.948 | -0.397 | 1.508 | 3.571 | 15.081 |
| V13 | 20000.000 | 1.580 | 2.875 | -13.228 | -0.224 | 1.637 | 3.460 | 15.420 |
| V14 | 20000.000 | -0.951 | 1.790 | -7.739 | -2.171 | -0.957 | 0.271 | 5.671 |
| V15 | 20000.000 | -2.415 | 3.355 | -16.417 | -4.415 | -2.383 | -0.359 | 12.246 |
| V16 | 20000.000 | -2.925 | 4.222 | -20.374 | -5.634 | -2.683 | -0.095 | 13.583 |
| V17 | 20000.000 | -0.134 | 3.345 | -14.091 | -2.216 | -0.015 | 2.069 | 16.756 |
| V18 | 20000.000 | 1.189 | 2.592 | -11.644 | -0.404 | 0.883 | 2.572 | 13.180 |
| V19 | 20000.000 | 1.182 | 3.397 | -13.492 | -1.050 | 1.279 | 3.493 | 13.238 |
| V20 | 20000.000 | 0.024 | 3.669 | -13.923 | -2.433 | 0.033 | 2.512 | 16.052 |
| V21 | 20000.000 | -3.611 | 3.568 | -17.956 | -5.930 | -3.533 | -1.266 | 13.840 |
| V22 | 20000.000 | 0.952 | 1.652 | -10.122 | -0.118 | 0.975 | 2.026 | 7.410 |
| V23 | 20000.000 | -0.366 | 4.032 | -14.866 | -3.099 | -0.262 | 2.452 | 14.459 |
| V24 | 20000.000 | 1.134 | 3.912 | -16.387 | -1.468 | 0.969 | 3.546 | 17.163 |
| V25 | 20000.000 | -0.002 | 2.017 | -8.228 | -1.365 | 0.025 | 1.397 | 8.223 |
| V26 | 20000.000 | 1.874 | 3.435 | -11.834 | -0.338 | 1.951 | 4.130 | 16.836 |
| V27 | 20000.000 | -0.612 | 4.369 | -14.905 | -3.652 | -0.885 | 2.189 | 17.560 |
| V28 | 20000.000 | -0.883 | 1.918 | -9.269 | -2.171 | -0.891 | 0.376 | 6.528 |
| V29 | 20000.000 | -0.986 | 2.684 | -12.579 | -2.787 | -1.176 | 0.630 | 10.722 |
| V30 | 20000.000 | -0.016 | 3.005 | -14.796 | -1.867 | 0.184 | 2.036 | 12.506 |
| V31 | 20000.000 | 0.487 | 3.461 | -13.723 | -1.818 | 0.490 | 2.731 | 17.255 |
| V32 | 20000.000 | 0.304 | 5.500 | -19.877 | -3.420 | 0.052 | 3.762 | 23.633 |
| V33 | 20000.000 | 0.050 | 3.575 | -16.898 | -2.243 | -0.066 | 2.255 | 16.692 |
| V34 | 20000.000 | -0.463 | 3.184 | -17.985 | -2.137 | -0.255 | 1.437 | 14.358 |
| V35 | 20000.000 | 2.230 | 2.937 | -15.350 | 0.336 | 2.099 | 4.064 | 15.291 |
| V36 | 20000.000 | 1.515 | 3.801 | -14.833 | -0.944 | 1.567 | 3.984 | 19.330 |
| V37 | 20000.000 | 0.011 | 1.788 | -5.478 | -1.256 | -0.128 | 1.176 | 7.467 |
| V38 | 20000.000 | -0.344 | 3.948 | -17.375 | -2.988 | -0.317 | 2.279 | 15.290 |
| V39 | 20000.000 | 0.891 | 1.753 | -6.439 | -0.272 | 0.919 | 2.058 | 7.760 |
| V40 | 20000.000 | -0.876 | 3.012 | -11.024 | -2.940 | -0.921 | 1.120 | 10.654 |
| Target | 20000.000 | 0.056 | 0.229 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
All columns consist of numeric values with varying distributions, we will need to explore more because there is a lack of context for variable meaning due to confidentiality.
df.duplicated().sum()
0
No duplicate data.
df.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
There are 18 missing values in the first two columns. These will be treated.
test_data.shape
(5000, 41)
There are 5000 rows and 41 columns.
test_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
As with the training data all predictor columns are float type and the target column is int type.
test_data.duplicated().sum()
0
No duplicated data.
test_data.isnull().sum()
V1 5 V2 6 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
There are a few missing values in the first two columns.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
round(df['V1'].median(),2)
-0.75
histogram_boxplot(df, 'V1')
The distribution for V1 is slightly skewed to the right. There are observable outliers with most values being centered around -0.7.
round(df['V2'].median(),2)
0.47
histogram_boxplot(df, 'V2')
The distribution for V2 resembles a normal distribution with outliers on both sides. The data is centered around 0.5.
round(df['V3'].median(),2)
2.26
histogram_boxplot(df, 'V3')
The distribution for V3 resembles a normal distribution with outliers present on both sides. The values are centered around 2.3.
round(df['V4'].median(),2)
-0.14
histogram_boxplot(df, 'V4')
The distribution for V4 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.
round(df['V5'].median(),2)
-0.1
histogram_boxplot(df, 'V5')
The distribution of V5 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.
round(df['V6'].median(),2)
-1.0
histogram_boxplot(df, 'V6')
The distribution for V6 resembles a normal distribution with outliers present on both sides. The data is centered around -1.0.
round(df['V7'].median(),2)
-0.92
histogram_boxplot(df, 'V7')
The distribution of V7 resembles a normal distribution with outliers present on both sides. The data is centered around -0.9.
round(df['V8'].median(),2)
-0.39
histogram_boxplot(df, 'V8')
The distribution for V8 resembles a normal distribution with outliers present on both sides. The data is centered around -0.4.
round(df['V9'].median(),2)
-0.07
histogram_boxplot(df, 'V9')
The distribution for V9 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.
round(df['V10'].median(),2)
0.1
histogram_boxplot(df, 'V10')
The distribution of V10 resembles a normal distribution with outliers present on both sides. The data is centered around 0.1.
round(df['V11'].median(),2)
-1.92
histogram_boxplot(df, 'V11')
The distribution for V11 resembles a normal distribution with outliers present on both sides. The data is centered around -1.9.
round(df['V12'].median(),2)
1.51
histogram_boxplot(df, 'V12')
The distribution for V12 resembles a normal distribution with outliers present on both sides. The data is centered around 1.5.
round(df['V13'].median(),2)
1.64
histogram_boxplot(df, 'V13')
The distribution for V13 resembles a normal distribution with outliers present on both sides. The data is centered around1.6.
round(df['V14'].median(),2)
-0.96
histogram_boxplot(df, 'V14')
The distribution for V14 resembles a normal distribution with outliers present on both sides. The data is centered around-1.0.
round(df['V15'].median(),2)
-2.38
histogram_boxplot(df, 'V15')
The distribution for V15 resembles a normal distribution with outliers present on both sides. The data is centered around -2.4.
round(df['V16'].median(),2)
-2.68
histogram_boxplot(df, 'V16')
The distribution for V16 is slightly skewed but still resembles a normal distribution with outliers present. The data is centered around -2.7.
round(df['V17'].median(),2)
-0.01
histogram_boxplot(df, 'V17')
The distribution for V17 resembles a normal distribution with outliers present. The data is centered around 0.0.
round(df['V18'].median(),2)
0.88
histogram_boxplot(df, 'V18')
The distribution for V18 is slightly skewed to the right with outliers present. The data is centered around 0.9.
round(df['V19'].median(),2)
1.28
histogram_boxplot(df, 'V19')
The distribution for V19 resembles a normal distribution with outliers present on both sides. The data is centered around 1.3.
round(df['V20'].median(),2)
0.03
histogram_boxplot(df, 'V20')
The distribution for V20 resembles a normal distribution with outliers present on both sides. The data is centered around 0.0.
round(df['V21'].median(),2)
-3.53
histogram_boxplot(df, 'V21')
The distribution for V21 resembles a normal distribution with outliers present on both sides. The data is centered around -3.5.
round(df['V22'].median(),2)
0.97
histogram_boxplot(df, 'V22')
The distribution of V22 resembles a normal distribution with outliers present on both sides. The data is centered around 1.0.
round(df['V23'].median(),2)
-0.26
histogram_boxplot(df, 'V23')
The distribution of V23 resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.
round(df['V24'].median(),2)
0.97
histogram_boxplot(df, 'V24')
The distribution of V24 resembles a normal distribution with outliers present on both sides. The data is centered around 1.0.
round(df['V25'].median(),2)
0.03
histogram_boxplot(df, 'V25')
The distribution of V25 resembles a normal distribution with outliers present. The data is centered around 0.0.
round(df['V26'].median(),2)
1.95
histogram_boxplot(df, 'V26')
The distribution of V26 resembles a normal distribution with outliers present on both sides. The data is centered around 2.0.
round(df['V27'].median(),2)
-0.88
histogram_boxplot(df, 'V27')
The distribution for V27 is slightly skewed to the right but still resmbles a normal distribution with outliers present on both sides. The data is centered around -0.9.
round(df['V28'].median(),2)
-0.89
histogram_boxplot(df, 'V28')
The distribution of V28 resembles a normal distribution with outliers on both sides. The data is centered around -0.9.
round(df['V29'].median(),2)
-1.18
histogram_boxplot(df, 'V29')
The distribution for V29 is slighlty skewed right with outliers present on both sides. The data is centered around -1.2.
round(df['V30'].median(),2)
0.18
histogram_boxplot(df, 'V30')
The distribution for V30 is slightly skewed left with outliers present on both sides. The data is centered around 0.2.
round(df['V31'].median(),2)
0.49
histogram_boxplot(df, 'V31')
The distribution of V31 resembles a normal distribution with outliers present on both sides. The data is centered around 0.5.
round(df['V32'].median(),2)
0.05
histogram_boxplot(df, 'V32')
The distribution for V32 is slightly skewed right but still resembles a normal distribution with outliers present on both sides. The data is centered around 0.1.
round(df['V33'].median(),2)
-0.07
histogram_boxplot(df, 'V33')
The distribution of V33 resembles a normal distribution with outliers present on both sides. The data is centered around -0.8.
round(df['V34'].median(),2)
-0.26
histogram_boxplot(df, 'V34')
The distribution for V34 is slightly skewed left but still resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.
round(df['V35'].median(),2)
2.1
histogram_boxplot(df, 'V35')
The distribution for V35 is slightly skewed right but still resembles a normal distribution with outliers present on both sides. The data is centered around 2.1.
round(df['V36'].median(),2)
1.57
histogram_boxplot(df, 'V36')
The distribution for V36 resembles a normal distribution with outliers present on both sides. The data is centered around 1.6.
round(df['V37'].median(),2)
-0.13
histogram_boxplot(df, 'V37')
The distribution for V37 is slightly skewed right with outliers present on both sides. The data is centered around -0.1.
round(df['V38'].median(),2)
-0.32
histogram_boxplot(df, 'V38')
The distribution of V38 resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.
round(df['V39'].median(),2)
0.92
histogram_boxplot(df, 'V39')
The distribution of V39 resembles a normal distribution with outliers present on both sides. The data is centered around 0.9.
round(df['V40'].median(),2)
-0.92
histogram_boxplot(df, 'V40')
The distribution of V40 resembles a normal distribution with outliers present on both sides. The data is centered around -0.9.
df['Target'].value_counts()
0 18890 1 1110 Name: Target, dtype: int64
print('The percentage of non falures: ', 18890/df['Target'].count())
print('The percentage of failures: ', 1110/df['Target'].count())
The percentage of non falures: 0.9445 The percentage of failures: 0.0555
sns.countplot(data = df, x = 'Target')
<Axes: xlabel='Target', ylabel='count'>
The percentage of failures in the dataset is approximately 6% and the percentage of non-faliures is approximately 94%. There is clearly an imbalance.
for feature in df.columns:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
df1 = df.copy()
test_data1 = test_data.copy()
#Separate X and Y in train
X = df1.drop('Target', axis=1)
y = df1['Target']
#Separate X and Y for test data
X_test = test_data1.drop('Target', axis=1)
y_test = test_data1['Target']
#Split train csv into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
#Print shape of train and validation
print(X_train.shape, X_val.shape)
(16000, 40) (4000, 40)
print(y_train.value_counts() / y_train.count())
print("-" * 30)
print(y_val.value_counts() / y_val.count())
0 0.945 1 0.056 Name: Target, dtype: float64 ------------------------------ 0 0.945 1 0.056 Name: Target, dtype: float64
The ratio is the same for train and validation, ensuring no data leakage.
#Using imputer
imputer = SimpleImputer(strategy = 'median')
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
# Transform the validation data
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)
#Impute nan from the test data
X_test = pd.DataFrame(imputer.fit_transform(X_test), columns=X_train.columns)
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 ------------------------------ V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
function for the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7196280073636767 lr: 0.48988129245223133 Random Forest: 0.7195899193804354 Bagging: 0.7083222243382213 AdaBoost: 0.6215641465117756 GBM: 0.7173363803719928 XGBoost: 0.810804291246112 Validation Performance: dtree: 0.7387387387387387 lr: 0.49099099099099097 Random Forest: 0.7432432432432432 Bagging: 0.7207207207207207 AdaBoost: 0.6576576576576577 GBM: 0.7432432432432432 XGBoost: 0.8153153153153153 CPU times: user 7min 42s, sys: 965 ms, total: 7min 43s Wall time: 6min 58s
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
When comparing boxplots of CV-score for the models with the original data, we see that Decision tree, Random forest, Bagging, GradientBoost and XGBoost perform the best.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
results2 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset with oversampled data:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results2.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset with oversampled data: dtree: 0.9732668119313808 lr: 0.8812865538044636 Random Forest: 0.9855744607906776 Bagging: 0.9781630048735123 AdaBoost: 0.8935280870047044 GBM: 0.9239674518302545 XGBoost: 0.9906035856141958 Validation Performance: dtree: 0.7387387387387387 lr: 0.49099099099099097 Random Forest: 0.7432432432432432 Bagging: 0.7207207207207207 AdaBoost: 0.6576576576576577 GBM: 0.7432432432432432 XGBoost: 0.8153153153153153 CPU times: user 12min 31s, sys: 1.76 s, total: 12min 33s Wall time: 11min
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results2)
ax.set_xticklabels(names)
plt.show()
When comparing boxplots of CV-score for the models with the over sampled data, we see that Decision tree, Random forest, Bagging, GradientBoost and XGBoost perform the best. With using oversampled data, we may get an overfit model, which is present in the metrics from training to validation. We can address the overfitting during tuning.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
%%time
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
results3 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results3.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.8468355233923697 lr: 0.8513235574176348 Random Forest: 0.8975052370976957 Bagging: 0.8704627689963816 AdaBoost: 0.8715927124992063 GBM: 0.8907446200723672 XGBoost: 0.8930108550752237 Validation Performance: dtree: 0.7387387387387387 lr: 0.49099099099099097 Random Forest: 0.7432432432432432 Bagging: 0.7207207207207207 AdaBoost: 0.6576576576576577 GBM: 0.7432432432432432 XGBoost: 0.8153153153153153 CPU times: user 2min 5s, sys: 476 ms, total: 2min 5s Wall time: 1min 52s
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results3)
ax.set_xticklabels(names)
plt.show()
When comparing boxplots of CV-score for the models with the under sampled data, we see that Random forest, Bagging, GradientBoost and XGBoost perform the best. When comparing the training set to the validation set, we can see a level of overfitting that can be addressed during tuning.
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
Sample tuning method for Decision tree with original data
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5675998222560782:
dtree_tuned = DecisionTreeClassifier(
min_samples_leaf = 7,
min_impurity_decrease = 0.0001,
max_leaf_nodes = 15,
max_depth = 5
)
dtree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7)dtree_grid = model_performance_classification_sklearn(dtree_tuned, X_train, y_train)
dtree_grid
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.593 | 0.904 | 0.717 |
dtree_grid_val = model_performance_classification_sklearn(dtree_tuned, X_val, y_val)
dtree_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969 | 0.577 | 0.810 | 0.674 |
confusion_matrix_sklearn(dtree_tuned, X_val, y_val)
Sample tuning method for Decision tree with oversampled data
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3} with CV score=0.9143060712783726:
dtree_tuned_over = DecisionTreeClassifier(
min_samples_leaf = 7,
min_impurity_decrease = 0.001,
max_leaf_nodes = 15,
max_depth = 3
)
dtree_tuned_over.fit(X_train_over, y_train_over)
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=15,
min_impurity_decrease=0.001, min_samples_leaf=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=3, max_leaf_nodes=15,
min_impurity_decrease=0.001, min_samples_leaf=7)dtree_grid_over = model_performance_classification_sklearn(dtree_tuned_over, X_train_over, y_train_over)
dtree_grid_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.838 | 0.917 | 0.792 | 0.850 |
dtree_grid_val_over = model_performance_classification_sklearn(dtree_tuned_over, X_val, y_val)
dtree_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.752 | 0.874 | 0.168 | 0.281 |
confusion_matrix_sklearn(dtree_tuned_over, X_val, y_val)
Sample tuning method for Decision tree with undersampled data
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
'min_samples_leaf': [1, 2, 5, 7],
'max_leaf_nodes' : [5, 10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 5, 'max_depth': 14} with CV score=0.8492287183393639:
dtree_tuned_under = DecisionTreeClassifier(
min_samples_leaf = 1,
min_impurity_decrease = 0.001,
max_leaf_nodes = 5,
max_depth = 14
)
dtree_tuned_under.fit(X_train_un, y_train_un)
DecisionTreeClassifier(max_depth=14, max_leaf_nodes=5,
min_impurity_decrease=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=14, max_leaf_nodes=5,
min_impurity_decrease=0.001)dtree_grid_under = model_performance_classification_sklearn(dtree_tuned_under, X_train_un, y_train_un)
dtree_grid_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.854 | 0.902 | 0.823 | 0.861 |
dtree_grid_val_under = model_performance_classification_sklearn(dtree_tuned_under, X_val, y_val)
dtree_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.768 | 0.878 | 0.178 | 0.296 |
confusion_matrix_sklearn(dtree_tuned_under, X_val, y_val)
Sample tuning method for Random Forest with original data
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.7038786262934045:
rf_tuned = RandomForestClassifier(
n_estimators = 300,
min_samples_leaf = 1,
max_samples = 0.6,
max_features = 'sqrt'
)
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(max_samples=0.6, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
rf_grid = model_performance_classification_sklearn(rf_tuned, X_train, y_train)
rf_grid
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995 | 0.909 | 1.000 | 0.952 |
rf_grid_val = model_performance_classification_sklearn(rf_tuned, X_val, y_val)
rf_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985 | 0.734 | 0.982 | 0.840 |
confusion_matrix_sklearn(rf_tuned, X_val, y_val)
Sample tuning method for Random Forest with over sampled data
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9808099737442019:
rf_tuned_over = RandomForestClassifier(
n_estimators = 300,
min_samples_leaf = 1,
max_samples = 0.6,
max_features = 'sqrt'
)
rf_tuned_over.fit(X_train_over, y_train_over)
RandomForestClassifier(max_samples=0.6, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
rf_grid_over = model_performance_classification_sklearn(rf_tuned_over, X_train, y_train)
rf_grid_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 0.999 | 0.999 |
rf_grid_val_over = model_performance_classification_sklearn(rf_tuned_over, X_val, y_val)
rf_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988 | 0.856 | 0.918 | 0.886 |
confusion_matrix_sklearn(rf_tuned_over, X_val, y_val)
Sample tuning method for Random Forest with under data
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8941979305529106:
rf_tuned_under = RandomForestClassifier(
n_estimators = 250,
min_samples_leaf = 2,
max_samples = 0.5,
max_features = 'sqrt'
)
rf_tuned_under.fit(X_train_un, y_train_un)
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)
rf_grid_under = model_performance_classification_sklearn(rf_tuned_under, X_train_un, y_train_un)
rf_grid_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.962 | 0.931 | 0.993 | 0.961 |
rf_grid_val_under = model_performance_classification_sklearn(rf_tuned_under, X_val, y_val)
rf_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.935 | 0.883 | 0.457 | 0.602 |
confusion_matrix_sklearn(rf_tuned_under, X_val, y_val)
Sample tuning method for Bagging with original data
%%time
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70], }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.728648511394655:
CPU times: user 37.7 s, sys: 258 ms, total: 38 s
Wall time: 21min 33s
bagging_tuned = BaggingClassifier(
n_estimators = 30,
max_samples = 0.9,
max_features = 0.9
)
bagging_tuned.fit(X_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=30)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=30)
bagging_grid = model_performance_classification_sklearn(bagging_tuned, X_train, y_train)
bagging_grid
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999 | 0.973 | 1.000 | 0.986 |
bagging_grid_val = model_performance_classification_sklearn(bagging_tuned, X_val, y_val)
bagging_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.739 | 0.965 | 0.837 |
confusion_matrix_sklearn(bagging_tuned, X_val, y_val)
Sample tuning method for Bagging with over sampled data
%%time
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70], }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.9835892615034132:
CPU times: user 1min 58s, sys: 1.11 s, total: 1min 59s
Wall time: 30min 41s
bagging_tuned_over = BaggingClassifier(
n_estimators = 70,
max_samples = 0.9,
max_features = 0.9
)
bagging_tuned_over.fit(X_train_over, y_train_over)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)
bagging_grid_over = model_performance_classification_sklearn(bagging_tuned_over, X_train_over, y_train_over)
bagging_grid_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
bagging_grid_val_over = model_performance_classification_sklearn(bagging_tuned_over, X_val, y_val)
bagging_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.860 | 0.857 | 0.858 |
confusion_matrix_sklearn(bagging_tuned_over, X_val, y_val)
Sample tuning method for Bagging with under sampled data
%%time
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70], }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.7} with CV score=0.8953215260585285:
CPU times: user 2.79 s, sys: 62 ms, total: 2.85 s
Wall time: 1min 3s
bagging_tuned_under = BaggingClassifier(
n_estimators = 70,
max_samples = 0.8,
max_features = 0.7
)
bagging_tuned_under.fit(X_train_un, y_train_un)
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70)
bagging_grid_under = model_performance_classification_sklearn(bagging_tuned_under, X_train_un, y_train_un)
bagging_grid_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 0.994 | 0.999 | 0.997 |
bagging_grid_val_under = model_performance_classification_sklearn(bagging_tuned_under, X_val, y_val)
bagging_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.943 | 0.896 | 0.491 | 0.635 |
confusion_matrix_sklearn(bagging_tuned_under, X_val, y_val)
Sample tuning method for GradientBoost with original data
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.7602678854821303:
CPU times: user 15.3 s, sys: 188 ms, total: 15.5 s
Wall time: 5min 59s
gboost_tuned = GradientBoostingClassifier(
subsample = 0.7,
n_estimators = 125,
max_features = 0.5,
learning_rate = 0.2
)
gboost_tuned.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)gboost_grid = model_performance_classification_sklearn(gboost_tuned, X_train, y_train)
gboost_grid
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.994 | 0.912 | 0.978 | 0.944 |
gboost_grid_val = model_performance_classification_sklearn(gboost_tuned, X_val, y_val)
gboost_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.978 | 0.766 | 0.829 | 0.796 |
confusion_matrix_sklearn(gboost_tuned, X_val, y_val)
Sample tuning method for GradientBoost with over sampled data
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 1} with CV score=0.9677077328831046:
CPU times: user 30.1 s, sys: 451 ms, total: 30.5 s
Wall time: 11min 48s
gboost_tuned_over = GradientBoostingClassifier(
subsample = 0.7,
n_estimators = 125,
max_features = 0.5,
learning_rate = 1
)
gboost_tuned_over.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
subsample=0.7)gboost_grid_over = model_performance_classification_sklearn(gboost_tuned_over, X_train_over, y_train_over)
gboost_grid_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993 | 0.992 | 0.993 | 0.993 |
gboost_grid_val_over = model_performance_classification_sklearn(gboost_tuned_over, X_val, y_val)
gboost_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969 | 0.842 | 0.678 | 0.751 |
confusion_matrix_sklearn(gboost_tuned_over, X_val, y_val)
Sample tuning method for GradientBoost with under sampled data
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.9031993905922681:
CPU times: user 1.59 s, sys: 66.8 ms, total: 1.66 s
Wall time: 42.3 s
gboost_tuned_under = GradientBoostingClassifier(
subsample = 0.7,
n_estimators = 125,
max_features = 0.5,
learning_rate = 0.2
)
gboost_tuned_under.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)gboost_grid_under = model_performance_classification_sklearn(gboost_tuned_under, X_train_un, y_train_un)
gboost_grid_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995 | 0.992 | 0.998 | 0.995 |
gboost_grid_val_under = model_performance_classification_sklearn(gboost_tuned_under, X_val, y_val)
gboost_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.935 | 0.883 | 0.456 | 0.601 |
confusion_matrix_sklearn(gboost_tuned_under, X_val, y_val)
Sample tuning method for XGBoost with original data
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.8536469243953533:
CPU times: user 53.2 s, sys: 862 ms, total: 54 s
Wall time: 19min 23s
xgboost_tuned = XGBClassifier(
subsample = 0.8,
scale_pos_weight = 10,
n_estimators = 200,
learning_rate = 0.1,
gamma = 5
)
xgboost_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)xgboost_grid = model_performance_classification_sklearn(xgboost_tuned, X_train, y_train)
xgboost_grid
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999 | 1.000 | 0.987 | 0.993 |
xgboost_grid_val = model_performance_classification_sklearn(xgboost_tuned, X_val, y_val)
xgboost_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.989 | 0.856 | 0.931 | 0.892 |
confusion_matrix_sklearn(xgboost_tuned, X_val, y_val)
Sample tuning method for XGBoost with over sampled data
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9958972606443475:
CPU times: user 1min 41s, sys: 2.25 s, total: 1min 43s
Wall time: 37min 8s
xgboost_tuned_over = XGBClassifier(
subsample = 0.8,
scale_pos_weight = 10,
n_estimators = 200,
learning_rate = 0.1,
gamma = 5
)
xgboost_tuned_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)xgboost_grid_over = model_performance_classification_sklearn(xgboost_tuned_over, X_train_over, y_train_over)
xgboost_grid_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.996 | 1.000 | 0.993 | 0.996 |
xgboost_grid_val_over = model_performance_classification_sklearn(xgboost_tuned_over, X_val, y_val)
xgboost_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.878 | 0.712 | 0.786 |
confusion_matrix_sklearn(xgboost_tuned_over, X_val, y_val)
Sample tuning method for XGBoost with under sampled data
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9223386021710149:
CPU times: user 6.19 s, sys: 152 ms, total: 6.34 s
Wall time: 2min 33s
xgboost_tuned_under = XGBClassifier(
subsample = 0.9,
scale_pos_weight = 10,
n_estimators = 200,
learning_rate = 0.1,
gamma = 5
)
xgboost_tuned_under.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)xgboost_grid_under = model_performance_classification_sklearn(xgboost_tuned_under, X_train_un, y_train_un)
xgboost_grid_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995 | 1.000 | 0.990 | 0.995 |
xgboost_grid_val_under = model_performance_classification_sklearn(xgboost_tuned_under, X_val, y_val)
xgboost_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.869 | 0.923 | 0.287 | 0.438 |
confusion_matrix_sklearn(xgboost_tuned_under, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
dtree_grid.T,
dtree_grid_over.T,
dtree_grid_under.T,
rf_grid.T,
rf_grid_over.T,
rf_grid_under.T,
bagging_grid.T,
bagging_grid_over.T,
bagging_grid_under.T,
gboost_grid.T,
gboost_grid_over.T,
gboost_grid_under.T,
xgboost_grid.T,
xgboost_grid_over.T,
xgboost_grid_under.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Tuned with original data",
"Decision Tree Tuned with over sampled data",
"Decision Tree Tuned with under sampled data",
"Random Forest Tuned with original data",
"Random Forest Tuned with over sampled data",
"Random Forest Tuned with under sampled data",
"Bagging Tuned with original data",
"Bagging Tuned with over sampled data",
"Bagging Tuned with under sampled data",
"Gradient Boost Tuned with original data",
"Gradient Boost Tuned with over sampled data",
"Gradient Boost Tuned with under sampled data",
"XGBoost Tuned with original data",
"XGBoost Tuned with over sampled data",
"XGBoost Tuned with under sampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree Tuned with original data | Decision Tree Tuned with over sampled data | Decision Tree Tuned with under sampled data | Random Forest Tuned with original data | Random Forest Tuned with over sampled data | Random Forest Tuned with under sampled data | Bagging Tuned with original data | Bagging Tuned with over sampled data | Bagging Tuned with under sampled data | Gradient Boost Tuned with original data | Gradient Boost Tuned with over sampled data | Gradient Boost Tuned with under sampled data | XGBoost Tuned with original data | XGBoost Tuned with over sampled data | XGBoost Tuned with under sampled data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.974 | 0.838 | 0.854 | 0.995 | 1.000 | 0.962 | 0.999 | 1.000 | 0.997 | 0.994 | 0.993 | 0.995 | 0.999 | 0.996 | 0.995 |
| Recall | 0.593 | 0.917 | 0.902 | 0.909 | 1.000 | 0.931 | 0.973 | 1.000 | 0.994 | 0.912 | 0.992 | 0.992 | 1.000 | 1.000 | 1.000 |
| Precision | 0.904 | 0.792 | 0.823 | 1.000 | 0.999 | 0.993 | 1.000 | 1.000 | 0.999 | 0.978 | 0.993 | 0.998 | 0.987 | 0.993 | 0.990 |
| F1 | 0.717 | 0.850 | 0.861 | 0.952 | 0.999 | 0.961 | 0.986 | 1.000 | 0.997 | 0.944 | 0.993 | 0.995 | 0.993 | 0.996 | 0.995 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
dtree_grid_val.T,
dtree_grid_val_over.T,
dtree_grid_val_under.T,
rf_grid_val.T,
rf_grid_val_over.T,
rf_grid_val_under.T,
bagging_grid_val.T,
bagging_grid_val_over.T,
bagging_grid_val_under.T,
gboost_grid_val.T,
gboost_grid_val_over.T,
gboost_grid_val_under.T,
xgboost_grid_val.T,
xgboost_grid_val_over.T,
xgboost_grid_val_under.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree Tuned with original data",
"Decision Tree Tuned with over sampled data",
"Decision Tree Tuned with under sampled data",
"Random Forest Tuned with original data",
"Random Forest Tuned with over sampled data",
"Random Forest Tuned with under sampled data",
"Bagging Tuned with original data",
"Bagging Tuned with over sampled data",
"Bagging Tuned with under sampled data",
"Gradient Boost Tuned with original data",
"Gradient Boost Tuned with over sampled data",
"Gradient Boost Tuned with under sampled data",
"XGBoost Tuned with original data",
"XGBoost Tuned with over sampled data",
"XGBoost Tuned with under sampled data"
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Decision Tree Tuned with original data | Decision Tree Tuned with over sampled data | Decision Tree Tuned with under sampled data | Random Forest Tuned with original data | Random Forest Tuned with over sampled data | Random Forest Tuned with under sampled data | Bagging Tuned with original data | Bagging Tuned with over sampled data | Bagging Tuned with under sampled data | Gradient Boost Tuned with original data | Gradient Boost Tuned with over sampled data | Gradient Boost Tuned with under sampled data | XGBoost Tuned with original data | XGBoost Tuned with over sampled data | XGBoost Tuned with under sampled data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.969 | 0.752 | 0.768 | 0.985 | 0.988 | 0.935 | 0.984 | 0.984 | 0.943 | 0.978 | 0.969 | 0.935 | 0.989 | 0.974 | 0.869 |
| Recall | 0.577 | 0.874 | 0.878 | 0.734 | 0.856 | 0.883 | 0.739 | 0.860 | 0.896 | 0.766 | 0.842 | 0.883 | 0.856 | 0.878 | 0.923 |
| Precision | 0.810 | 0.168 | 0.178 | 0.982 | 0.918 | 0.457 | 0.965 | 0.857 | 0.491 | 0.829 | 0.678 | 0.456 | 0.931 | 0.712 | 0.287 |
| F1 | 0.674 | 0.281 | 0.296 | 0.840 | 0.886 | 0.602 | 0.837 | 0.858 | 0.635 | 0.796 | 0.751 | 0.601 | 0.892 | 0.786 | 0.438 |
From comparing the performance of the models we prioritize recall since unpredicted failures are the most expensive scenario. We are picking a final model based on recall performance and F1 score while comparing training to validation set. The model that performs well and may be best for production is Gradient Boost with over sampling as the metrics are not showing major signs of overfitting and the recall and F1 scores are high.
#Testing best model on testing data set
gboost_over_test = model_performance_classification_sklearn(gboost_tuned_over, X_test, y_test)
print("Test Performance:")
gboost_over_test
Test Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965 | 0.826 | 0.645 | 0.725 |
confusion_matrix_sklearn(gboost_tuned_over, X_test, y_test)
The test data was used on the final model and produced a recall of approximately 0.83 and a F1 score of approximately 0.73. The confusion matrix shown above also indicates good performance, as the unpredicted failures are less than 1%.
feature_names = X.columns
importances = gboost_tuned_over.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
When analyzing the key predictive features of the final model, V39 and V18 are high in importance. V34, V26 and V11 are also mentionable variables that should be monitored.
#creating list of variables
features = [
"V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10",
"V11", "V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20",
"V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30",
"V31", "V32", "V33", "V34", "V35", "V36", "V37", "V38", "V39", "V40"
]
#transformer for missing values to replace with median
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
preprocessor = ColumnTransformer(
transformers=[
("variables", numeric_transformer, features)],
remainder="passthrough",
)
#Separate X and y
X = df1.drop("Target", axis=1)
y = df1["Target"]
# Splitting data into train and test 80:20
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(16000, 40) (4000, 40)
#Checking target ratio in train and test
print(y_train.value_counts() / y_train.count())
print("-" * 30)
print(y_test.value_counts() / y_test.count())
0 0.945 1 0.056 Name: Target, dtype: float64 ------------------------------ 0 0.945 1 0.056 Name: Target, dtype: float64
#Pipeline with best found model
model = Pipeline(
steps=[
("pre", preprocessor),
(
"GBM",
GradientBoostingClassifier(
subsample = 0.7,
n_estimators = 125,
max_features = 0.5,
learning_rate = 1
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('variables',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...])])),
('GBM',
GradientBoostingClassifier(learning_rate=1, max_features=0.5,
n_estimators=125, subsample=0.7))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('variables',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...])])),
('GBM',
GradientBoostingClassifier(learning_rate=1, max_features=0.5,
n_estimators=125, subsample=0.7))])ColumnTransformer(remainder='passthrough',
transformers=[('variables',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7',
'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26',
'V27', 'V28', 'V29', 'V30', ...])])['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
SimpleImputer(strategy='median')
[]
passthrough
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
subsample=0.7)#checking performance on test data
model_test = model_performance_classification_sklearn(model, X_test, y_test)
model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.721 | 0.773 | 0.746 |
From our analysis, we have built and tested various models and have chosen to move forward with a GradientBoost Classifier built with oversampled data. The model was chosen for it's performance in recall (since this is given more importance for the cost of an unpredicted failure) and comparable metrics from training to testing sets.
EDA showed that most of the collected data resembled a normal distribution.
From the model, we have found that V39 and V18 are key factors in failure prediction. V34, V26 and V11 are also notable factors with prediction in failure. Due to confidentiality, there is missing context, but recommendation would be to explore these factors of degredation. Monitor these factors specifically and make repairs when needed to avoid the cost of having to replace equipment.
As well as a model being built, a pipeline has been constructed for production for easy analysis going forward.